# Load packages
library(forecast)
library(prophet)
library(e1071)
library(tidyverse)
library(lubridate)
library(dplyr)
library(vars)
library(mFilter)
library(forecast)
library(fpp3)
library(doParallel)
library(caret)
library(xgboost)

Introduction

Project Goal: Maverik is interested in producing more accurate financial plans and initial ROI documents for future convenience store locations.The goal of this project is to develop a predictive model that is precise enough for forecasting the first-year sales of new stores that Maverik plans to open.This predictive model will be an ensemble of forecasting and supervised regression models designed to provide daily store level sales forecasts of multiple key product categories and aid Maverik in financial planning, resource allocation, and ROI calculations for its expansion strategy. The Success of this project will be benchmarked against Maverik’s existing Naive forecasting solution. The ability to accurately forecast store sales will enable Maverik to optimize the expenditure of scarce resources by pursuing the most profitable locations and minimizing misallocation of said resources when opening a new store. This will lead to better investment, decreased waste, and more reliable financial evaluations.

Business Problems

Maverik aims to open 30 new stores annually and requires an accurate predictive model for the first-year sales to support financial planning and ROI calculations.

Benefits of the Solution

Precise forecasts will enable them to make informed decisions on store locations and resource allocation along with achieving set sales targets while checking the progress.

Success Matrix

The solution provided will be considered a success if it generates forecasts accurate to within 10% of actual sales, can update forecasts based on new data along with being user-friendly and easy to support.

Analytical Approach

Four models will be developed to leverage Maverik’s historical sales data: Vector AutoRegressive Model, Prophet, Support Vector Regression, and Extreme Gradient Boosting. Each model will be tuned to minimize RMSE, which will be the metric used to select the final model.

Vector AutoRegressive Model (VAR)

Preparing data

# Load data
t<- read.csv("t_series.csv")
t$open_date <- as.Date(t$open_date)
t$date <- as.Date(t$date)

Spiltting data into train and test

set.seed(1234)  
 
t_sites <- sample(unique(t$site_id), 30)
 
train_sales <- t %>%   filter(site_id %in% t_sites)
test_sales <- t %>%   filter(!site_id %in% t_sites)

We are choosing 30 unique site_id’s in train data and remaining in test data

Creating time series

Train Data

inside_ts <- ts(train_sales$inside_sales, start=c(2021,01), end = c(2023,8), frequency = 12)
food_ts <- ts(train_sales$food_service_sales, start=c(2021,01), end = c(2023,8), frequency = 12)
diesel_ts <- ts(train_sales$diesel_sales, start=c(2021,01), end = c(2023,8), frequency = 12)
unleaded_ts <- ts(train_sales$unleaded_sales, start=c(2021,01), end = c(2023,8), frequency = 12)
plot(cbind(inside_ts,food_ts,diesel_ts,unleaded_ts))

# Combining ts objects
sales_train <- cbind(inside_ts, food_ts, diesel_ts, unleaded_ts)
colnames(sales_train) <- c("inside_ts", "food_ts", "diesel_ts", "unleaded_ts")

Above plot shows both trend and seasonality in the data.

Test Data

inside_ts_test <- ts(test_sales$inside_sales, start = c(2021, 01), end = c(2023, 8), frequency = 12)
food_ts_test <- ts(test_sales$food_service_sales, start = c(2021, 01), end = c(2023, 8), frequency = 12)
diesel_ts_test <- ts(test_sales$diesel_sales, start = c(2021, 01), end = c(2023, 8), frequency = 12)
unleaded_ts_test <- ts(test_sales$unleaded_sales, start = c(2021, 01), end = c(2023, 8), frequency = 12)

# Combining ts objects
sales_test <- cbind(inside_ts_test, food_ts_test, diesel_ts_test, unleaded_ts_test)
colnames(sales_test) <- c("inside_ts_test", "food_ts_test", "diesel_ts_test", "unleaded_ts_test")

ACF Plots

Inside Sales

inside.acf <- acf(inside_ts, main = "inside_sales")

adf_inside <- ur.df(inside_ts, type = "trend", selectlags = "AIC")
summary(adf_inside)
## 
## ############################################### 
## # Augmented Dickey-Fuller Test Unit Root Test # 
## ############################################### 
## 
## Test regression trend 
## 
## 
## Call:
## lm(formula = z.diff ~ z.lag.1 + 1 + tt + z.diff.lag)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -922.13 -246.57  -77.99  280.10 1007.39 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 1245.1338   488.3419   2.550   0.0170 *
## z.lag.1       -0.5902     0.2324  -2.540   0.0174 *
## tt            -6.6482     9.9542  -0.668   0.5101  
## z.diff.lag    -0.1391     0.1944  -0.715   0.4809  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 471.7 on 26 degrees of freedom
## Multiple R-squared:  0.3645, Adjusted R-squared:  0.2912 
## F-statistic: 4.971 on 3 and 26 DF,  p-value: 0.007379
## 
## 
## Value of test-statistic is: -2.5399 2.3029 3.4457 
## 
## Critical values for test statistics: 
##       1pct  5pct 10pct
## tau3 -4.15 -3.50 -3.18
## phi2  7.02  5.13  4.31
## phi3  9.31  6.73  5.61

The F-statistic evaluates the overall significance of the model. In this instance, it has a value of 4.971, with a low p-value, indicating that the model is statistically significant.

Food Sales

food.acf <- acf(food_ts, main = "food_sales")

adf_food <- ur.df(food_ts, type = "trend", selectlags = "AIC")
summary(adf_food)
## 
## ############################################### 
## # Augmented Dickey-Fuller Test Unit Root Test # 
## ############################################### 
## 
## Test regression trend 
## 
## 
## Call:
## lm(formula = z.diff ~ z.lag.1 + 1 + tt + z.diff.lag)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -299.60  -86.44  -25.34   81.40  398.00 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 465.9202   177.7706   2.621   0.0145 *
## z.lag.1      -0.6293     0.2443  -2.576   0.0160 *
## tt           -2.0140     3.1060  -0.648   0.5224  
## z.diff.lag   -0.1648     0.1923  -0.857   0.3993  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 147.2 on 26 degrees of freedom
## Multiple R-squared:  0.4065, Adjusted R-squared:  0.338 
## F-statistic: 5.935 on 3 and 26 DF,  p-value: 0.003177
## 
## 
## Value of test-statistic is: -2.5757 2.3881 3.5729 
## 
## Critical values for test statistics: 
##       1pct  5pct 10pct
## tau3 -4.15 -3.50 -3.18
## phi2  7.02  5.13  4.31
## phi3  9.31  6.73  5.61

p-value for the ADF test is 0.003177, which is less than the commonly used significance level of 0.05. This indicates that there is statistical evidence to reject the null hypothesis of a unit root, suggesting that the data is stationary.

Diesel Sales

diesel.acf <- acf(diesel_ts, main = "diesel_sales")

adf_diesel <- ur.df(diesel_ts, type = "trend", selectlags = "AIC")
summary(adf_diesel)
## 
## ############################################### 
## # Augmented Dickey-Fuller Test Unit Root Test # 
## ############################################### 
## 
## Test regression trend 
## 
## 
## Call:
## lm(formula = z.diff ~ z.lag.1 + 1 + tt + z.diff.lag)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -689.89 -349.60   70.05  337.05  961.09 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3634.19973  930.22811   3.907 0.000596 ***
## z.lag.1       -1.08341    0.27004  -4.012 0.000453 ***
## tt           -63.90783   18.30120  -3.492 0.001731 ** 
## z.diff.lag     0.02066    0.17202   0.120 0.905325    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 462.1 on 26 degrees of freedom
## Multiple R-squared:  0.5307, Adjusted R-squared:  0.4766 
## F-statistic: 9.802 on 3 and 26 DF,  p-value: 0.0001682
## 
## 
## Value of test-statistic is: -4.0121 5.5246 8.0661 
## 
## Critical values for test statistics: 
##       1pct  5pct 10pct
## tau3 -4.15 -3.50 -3.18
## phi2  7.02  5.13  4.31
## phi3  9.31  6.73  5.61

p-value for the ADF test is 0.0001682, which is less than the commonly used significance level of 0.05. This indicates that there is statistical evidence to reject the null hypothesis of a unit root, suggesting that the data is stationary.

Unleaded Sales

unleaded.acf <- acf(unleaded_ts, main = "unleaded_sales")

adf_unleaded <- ur.df(unleaded_ts, type = "trend", selectlags = "AIC")
summary(adf_unleaded)
## 
## ############################################### 
## # Augmented Dickey-Fuller Test Unit Root Test # 
## ############################################### 
## 
## Test regression trend 
## 
## 
## Call:
## lm(formula = z.diff ~ z.lag.1 + 1 + tt + z.diff.lag)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -900.75 -198.04  -57.67  138.40 1157.21 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 839.33953  318.40780   2.636   0.0140 *
## z.lag.1      -0.58687    0.21900  -2.680   0.0126 *
## tt           -9.21487    8.91829  -1.033   0.3110  
## z.diff.lag   -0.02816    0.19871  -0.142   0.8884  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 419.8 on 26 degrees of freedom
## Multiple R-squared:  0.3105, Adjusted R-squared:  0.2309 
## F-statistic: 3.902 on 3 and 26 DF,  p-value: 0.01995
## 
## 
## Value of test-statistic is: -2.6798 2.6368 3.9417 
## 
## Critical values for test statistics: 
##       1pct  5pct 10pct
## tau3 -4.15 -3.50 -3.18
## phi2  7.02  5.13  4.31
## phi3  9.31  6.73  5.61

p-value for the ADF test is 0.01995, which is less than the commonly used significance level of 0.05. This indicates that there is statistical evidence to reject the null hypothesis of a unit root, suggesting that the data is stationary.

Model Fitting

Choosing lags to implement VAR model

info_sales_train <- VARselect(sales_train, lag.max = 10, type = "none")
info_sales_train$selection
## AIC(n)  HQ(n)  SC(n) FPE(n) 
##      6      6      6      5

The VARselect function recommends a lag order of 6 for a Vector Autoregression (VAR) model on the “sales_train” dataset based on multiple information criteria, including AIC, HQ, SC, and FPE. This choice aims to strike a balance between model complexity and goodness of fit.

Fitting VAR model on Train data

sales_train_est <- VAR(sales_train, p = 5, type = "none", season = NULL, exog = NULL)
summary(sales_train_est)
## 
## VAR Estimation Results:
## ========================= 
## Endogenous variables: inside_ts, food_ts, diesel_ts, unleaded_ts 
## Deterministic variables: none 
## Sample size: 27 
## Log Likelihood: -636.192 
## Roots of the characteristic polynomial:
## 1.052 1.052 1.042 1.042 0.9984 0.9569 0.9569 0.9454 0.9454 0.9398 0.9057 0.9057 0.8899 0.8899 0.8184 0.8046 0.8046 0.6965 0.6965 0.3019
## Call:
## VAR(y = sales_train, p = 5, type = "none", exogen = NULL)
## 
## 
## Estimation results for equation inside_ts: 
## ========================================== 
## inside_ts = inside_ts.l1 + food_ts.l1 + diesel_ts.l1 + unleaded_ts.l1 + inside_ts.l2 + food_ts.l2 + diesel_ts.l2 + unleaded_ts.l2 + inside_ts.l3 + food_ts.l3 + diesel_ts.l3 + unleaded_ts.l3 + inside_ts.l4 + food_ts.l4 + diesel_ts.l4 + unleaded_ts.l4 + inside_ts.l5 + food_ts.l5 + diesel_ts.l5 + unleaded_ts.l5 
## 
##                 Estimate Std. Error t value Pr(>|t|)
## inside_ts.l1    2.525123   1.947113   1.297    0.236
## food_ts.l1     -6.759437   3.970384  -1.702    0.132
## diesel_ts.l1   -0.136821   0.402628  -0.340    0.744
## unleaded_ts.l1 -0.446114   1.470191  -0.303    0.770
## inside_ts.l2   -0.721842   1.631657  -0.442    0.672
## food_ts.l2      1.376699   4.102316   0.336    0.747
## diesel_ts.l2    0.007008   0.459072   0.015    0.988
## unleaded_ts.l2  0.481895   0.908350   0.531    0.612
## inside_ts.l3    1.404188   1.525867   0.920    0.388
## food_ts.l3     -1.215320   3.772417  -0.322    0.757
## diesel_ts.l3    0.502640   0.356510   1.410    0.201
## unleaded_ts.l3 -1.275459   0.835377  -1.527    0.171
## inside_ts.l4   -1.342801   1.702937  -0.789    0.456
## food_ts.l4      6.278383   4.843092   1.296    0.236
## diesel_ts.l4   -0.136371   0.312869  -0.436    0.676
## unleaded_ts.l4  0.026408   0.972489   0.027    0.979
## inside_ts.l5    0.614122   2.172163   0.283    0.786
## food_ts.l5     -1.507211   6.351803  -0.237    0.819
## diesel_ts.l5    0.107944   0.342509   0.315    0.762
## unleaded_ts.l5 -0.827422   0.909132  -0.910    0.393
## 
## 
## Residual standard error: 566.5 on 7 degrees of freedom
## Multiple R-Squared: 0.9793,  Adjusted R-squared: 0.9201 
## F-statistic: 16.55 on 20 and 7 DF,  p-value: 0.0004479 
## 
## 
## Estimation results for equation food_ts: 
## ======================================== 
## food_ts = inside_ts.l1 + food_ts.l1 + diesel_ts.l1 + unleaded_ts.l1 + inside_ts.l2 + food_ts.l2 + diesel_ts.l2 + unleaded_ts.l2 + inside_ts.l3 + food_ts.l3 + diesel_ts.l3 + unleaded_ts.l3 + inside_ts.l4 + food_ts.l4 + diesel_ts.l4 + unleaded_ts.l4 + inside_ts.l5 + food_ts.l5 + diesel_ts.l5 + unleaded_ts.l5 
## 
##                Estimate Std. Error t value Pr(>|t|)  
## inside_ts.l1    0.88242    0.54083   1.632   0.1468  
## food_ts.l1     -2.36804    1.10282  -2.147   0.0689 .
## diesel_ts.l1   -0.04512    0.11183  -0.403   0.6987  
## unleaded_ts.l1 -0.15227    0.40836  -0.373   0.7203  
## inside_ts.l2   -0.16511    0.45321  -0.364   0.7264  
## food_ts.l2      0.26052    1.13947   0.229   0.8257  
## diesel_ts.l2    0.02320    0.12751   0.182   0.8608  
## unleaded_ts.l2  0.12919    0.25230   0.512   0.6244  
## inside_ts.l3    0.47651    0.42383   1.124   0.2980  
## food_ts.l3     -0.52720    1.04783  -0.503   0.6303  
## diesel_ts.l3    0.16004    0.09902   1.616   0.1501  
## unleaded_ts.l3 -0.37796    0.23204  -1.629   0.1474  
## inside_ts.l4   -0.40542    0.47301  -0.857   0.4198  
## food_ts.l4      2.11883    1.34522   1.575   0.1592  
## diesel_ts.l4   -0.04966    0.08690  -0.571   0.5856  
## unleaded_ts.l4 -0.05502    0.27012  -0.204   0.8444  
## inside_ts.l5    0.34174    0.60334   0.566   0.5888  
## food_ts.l5     -0.69921    1.76429  -0.396   0.7037  
## diesel_ts.l5    0.01537    0.09514   0.162   0.8762  
## unleaded_ts.l5 -0.30146    0.25252  -1.194   0.2714  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## Residual standard error: 157.3 on 7 degrees of freedom
## Multiple R-Squared: 0.9873,  Adjusted R-squared: 0.9509 
## F-statistic: 27.15 on 20 and 7 DF,  p-value: 8.623e-05 
## 
## 
## Estimation results for equation diesel_ts: 
## ========================================== 
## diesel_ts = inside_ts.l1 + food_ts.l1 + diesel_ts.l1 + unleaded_ts.l1 + inside_ts.l2 + food_ts.l2 + diesel_ts.l2 + unleaded_ts.l2 + inside_ts.l3 + food_ts.l3 + diesel_ts.l3 + unleaded_ts.l3 + inside_ts.l4 + food_ts.l4 + diesel_ts.l4 + unleaded_ts.l4 + inside_ts.l5 + food_ts.l5 + diesel_ts.l5 + unleaded_ts.l5 
## 
##                Estimate Std. Error t value Pr(>|t|)  
## inside_ts.l1   -1.16679    1.55247  -0.752   0.4768  
## food_ts.l1      0.83785    3.16566   0.265   0.7989  
## diesel_ts.l1    0.70268    0.32102   2.189   0.0648 .
## unleaded_ts.l1  1.02462    1.17221   0.874   0.4110  
## inside_ts.l2    1.65641    1.30095   1.273   0.2436  
## food_ts.l2     -2.16548    3.27085  -0.662   0.5291  
## diesel_ts.l2    0.25308    0.36603   0.691   0.5116  
## unleaded_ts.l2 -1.14921    0.72424  -1.587   0.1566  
## inside_ts.l3    1.12302    1.21660   0.923   0.3867  
## food_ts.l3     -2.65856    3.00782  -0.884   0.4061  
## diesel_ts.l3   -0.63980    0.28425  -2.251   0.0591 .
## unleaded_ts.l3 -0.62343    0.66606  -0.936   0.3804  
## inside_ts.l4    0.21643    1.35778   0.159   0.8779  
## food_ts.l4     -2.02441    3.86148  -0.524   0.6163  
## diesel_ts.l4    0.25271    0.24946   1.013   0.3448  
## unleaded_ts.l4  0.09233    0.77538   0.119   0.9086  
## inside_ts.l5   -1.47739    1.73191  -0.853   0.4219  
## food_ts.l5      5.84095    5.06441   1.153   0.2866  
## diesel_ts.l5    0.38265    0.27309   1.401   0.2039  
## unleaded_ts.l5  0.17946    0.72487   0.248   0.8116  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## Residual standard error: 451.7 on 7 degrees of freedom
## Multiple R-Squared: 0.9902,  Adjusted R-squared: 0.9622 
## F-statistic: 35.41 on 20 and 7 DF,  p-value: 3.509e-05 
## 
## 
## Estimation results for equation unleaded_ts: 
## ============================================ 
## unleaded_ts = inside_ts.l1 + food_ts.l1 + diesel_ts.l1 + unleaded_ts.l1 + inside_ts.l2 + food_ts.l2 + diesel_ts.l2 + unleaded_ts.l2 + inside_ts.l3 + food_ts.l3 + diesel_ts.l3 + unleaded_ts.l3 + inside_ts.l4 + food_ts.l4 + diesel_ts.l4 + unleaded_ts.l4 + inside_ts.l5 + food_ts.l5 + diesel_ts.l5 + unleaded_ts.l5 
## 
##                 Estimate Std. Error t value Pr(>|t|)
## inside_ts.l1    1.765755   1.721589   1.026    0.339
## food_ts.l1     -4.685155   3.510515  -1.335    0.224
## diesel_ts.l1   -0.275173   0.355994  -0.773    0.465
## unleaded_ts.l1 -0.304359   1.299906  -0.234    0.822
## inside_ts.l2   -1.529180   1.442671  -1.060    0.324
## food_ts.l2      2.612443   3.627166   0.720    0.495
## diesel_ts.l2    0.058622   0.405900   0.144    0.889
## unleaded_ts.l2  0.886723   0.803140   1.104    0.306
## inside_ts.l3    1.297470   1.349134   0.962    0.368
## food_ts.l3     -1.613280   3.335478  -0.484    0.643
## diesel_ts.l3    0.586694   0.315218   1.861    0.105
## unleaded_ts.l3 -0.952015   0.738620  -1.289    0.238
## inside_ts.l4   -0.447666   1.505695  -0.297    0.775
## food_ts.l4      3.432805   4.282142   0.802    0.449
## diesel_ts.l4    0.006449   0.276631   0.023    0.982
## unleaded_ts.l4  0.015899   0.859850   0.018    0.986
## inside_ts.l5    0.591662   1.920573   0.308    0.767
## food_ts.l5     -1.998878   5.616106  -0.356    0.732
## diesel_ts.l5   -0.067544   0.302838  -0.223    0.830
## unleaded_ts.l5 -0.710057   0.803832  -0.883    0.406
## 
## 
## Residual standard error: 500.9 on 7 degrees of freedom
## Multiple R-Squared: 0.9597,  Adjusted R-squared: 0.8446 
## F-statistic: 8.335 on 20 and 7 DF,  p-value: 0.004005 
## 
## 
## 
## Covariance matrix of residuals:
##             inside_ts food_ts diesel_ts unleaded_ts
## inside_ts      320883   86659    -43135      273510
## food_ts         86659   24757     -2911       71175
## diesel_ts      -43135   -2911    203981      -53119
## unleaded_ts    273510   71175    -53119      250848
## 
## Correlation matrix of residuals:
##             inside_ts  food_ts diesel_ts unleaded_ts
## inside_ts      1.0000  0.97228  -0.16860      0.9640
## food_ts        0.9723  1.00000  -0.04097      0.9032
## diesel_ts     -0.1686 -0.04097   1.00000     -0.2348
## unleaded_ts    0.9640  0.90318  -0.23483      1.0000

The VAR model results show that all four variables are significantly correlated with each other. The inside_ts variable is the most correlated with all other variables, followed by unleaded_ts, food_ts, and diesel_ts.

The residual covariance matrix shows that the residuals of the four variables are also significantly correlated with each other. The inside_ts and unleaded_ts residuals are the most correlated, followed by the food_ts and diesel_ts residuals.

The residual correlation matrix shows that the correlation between the residuals of inside_ts and unleaded_ts is 0.9640, which is very high. This suggests that the two variables share a lot of common information.

Overall, the VAR model results suggest that the four variables are highly correlated with each other. This means that the variables move together over time. The VAR model can be used to forecast future values of the variables based on their current and past values

Performing Portmanteau-test on model Residuals for train data

train_serial <- serial.test(sales_train_est, lags.pt = 19, type = "PT.asymptotic")
train_serial
## 
##  Portmanteau Test (asymptotic)
## 
## data:  Residuals of VAR object sales_train_est
## Chi-squared = 249.51, df = 224, p-value = 0.1163

In the case of the VAR model for the sales data, the Portmanteau test statistic is 249.51 with 224 degrees of freedom. The p-value is 0.1163, which is greater than the significance level of 0.05. This means that we fail to reject the null hypothesis of no autocorrelation in the residuals.

The Portmanteau test suggests that there is no evidence of autocorrelation in the residuals of the VAR model. This is a good sign, as it means that the VAR model is a good fit for the data.

Residual plots for each sales metrics

plot(train_serial, names = "inside_ts")

plot(train_serial, names = "food_ts")

plot(train_serial, names = "unleaded_ts")

plot(train_serial, names = "diesel_ts")

The residual plots show that the residuals of the VAR model are randomly scattered around the zero line, with no obvious patterns or outliers. This suggests that the VAR model is a good fit for the data.

Granger causality Check to check variables’ causality

train_cause_inside_ts <- causality(sales_train_est, cause = "inside_ts")
train_cause_inside_ts
## $Granger
## 
##  Granger causality H0: inside_ts do not Granger-cause food_ts diesel_ts
##  unleaded_ts
## 
## data:  VAR object sales_train_est
## F-Test = 1.0403, df1 = 15, df2 = 28, p-value = 0.4473
## 
## 
## $Instant
## 
##  H0: No instantaneous causality between: inside_ts and food_ts diesel_ts
##  unleaded_ts
## 
## data:  VAR object sales_train_est
## Chi-squared = 13.412, df = 3, p-value = 0.003825
train_cause_food_ts <- causality(sales_train_est, cause = "food_ts")
train_cause_food_ts
## $Granger
## 
##  Granger causality H0: food_ts do not Granger-cause inside_ts diesel_ts
##  unleaded_ts
## 
## data:  VAR object sales_train_est
## F-Test = 1.1496, df1 = 15, df2 = 28, p-value = 0.3623
## 
## 
## $Instant
## 
##  H0: No instantaneous causality between: food_ts and inside_ts diesel_ts
##  unleaded_ts
## 
## data:  VAR object sales_train_est
## Chi-squared = 13.298, df = 3, p-value = 0.004035
train_cause_diesel_ts <- causality(sales_train_est, cause = "diesel_ts")
train_cause_diesel_ts
## $Granger
## 
##  Granger causality H0: diesel_ts do not Granger-cause inside_ts food_ts
##  unleaded_ts
## 
## data:  VAR object sales_train_est
## F-Test = 2.9168, df1 = 15, df2 = 28, p-value = 0.006992
## 
## 
## $Instant
## 
##  H0: No instantaneous causality between: diesel_ts and inside_ts food_ts
##  unleaded_ts
## 
## data:  VAR object sales_train_est
## Chi-squared = 6.318, df = 3, p-value = 0.09712
train_cause_unleaded_ts <- causality(sales_train_est, cause = "unleaded_ts")
train_cause_unleaded_ts
## $Granger
## 
##  Granger causality H0: unleaded_ts do not Granger-cause inside_ts
##  food_ts diesel_ts
## 
## data:  VAR object sales_train_est
## F-Test = 1.0638, df1 = 15, df2 = 28, p-value = 0.4281
## 
## 
## $Instant
## 
##  H0: No instantaneous causality between: unleaded_ts and inside_ts
##  food_ts diesel_ts
## 
## data:  VAR object sales_train_est
## Chi-squared = 13.159, df = 3, p-value = 0.004305

With the test it is observed that there is no instantaneous causality between any of the variables with each other.

Forecasting

sales_test_forecast <- predict(sales_train_est, n.ahead = nrow(sales_test), ci = 0.95, dumvar = NULL, dumvar.forecast = NULL)
# Extract forecasted values for each variable in test data
inside_forecast_values <- sales_test_forecast$fcst$inside_ts
food_forecast_values <- sales_test_forecast$fcst$food_ts
diesel_forecast_values <- sales_test_forecast$fcst$diesel_ts
unleaded_forecast_values <- sales_test_forecast$fcst$unleaded_ts

# Create time series for each forecasted variable in test data
inside_forecast_ts <- ts(inside_forecast_values, start = c(2021, 1), end = c(2023, 12), frequency = 12)
food_forecast_ts <- ts(food_forecast_values, start = c(2021, 1), end = c(2023, 12), frequency = 12)
diesel_forecast_ts <- ts(diesel_forecast_values, start = c(2021, 1), end = c(2023, 12), frequency = 12)
unleaded_forecast_ts <- ts(unleaded_forecast_values, start = c(2021, 1), end = c(2023, 12), frequency = 12)

# Create and display plots for each forecasted variable with months on x-axis on test data
autoplot(inside_forecast_ts, xlab = "Year", ylab = "Inside Sales Forecast", main = "Inside Sales Forecast on Test Data") +
  scale_x_yearmon()  

autoplot(food_forecast_ts, xlab = "Year", ylab = "Food Sales Forecast", main = "Food Sales Forecast on Test Data") +
  scale_x_yearmon()

autoplot(diesel_forecast_ts, xlab = "Year", ylab = "Diesel Sales Forecast", main = "Diesel Sales Forecast on Test Data") +
  scale_x_yearmon()

autoplot(unleaded_forecast_ts, xlab = "Year", ylab = "Unleaded Sales Forecast", main = "Unleaded Sales Forecast on Test Data") +
  scale_x_yearmon()

All 4 plot shows the forecast on test data for various sales metrics. The orange line is the forecast. The forecast is very close to the actual sales, suggesting that the model is doing a good job of forecasting inside sales.

The plot also shows two other lines: the upper and lower confidence intervals. The confidence intervals indicate the range of values within which the actual sales are likely to fall. The narrower the confidence intervals, the more confident we can be in the forecast.

The confidence intervals in the plot are relatively narrow, suggesting that we can be fairly confident in the forecast.

Overall, the plot shows that the VAR model is doing a good job of forecasting on test data.

Model Evaluation

Train Data

# Generate forecasts for the training data
train_forecast <- predict(sales_train_est, n.ahead = nrow(sales_train))

# Extract the actual and forecasted values for each response variable - train data
actual_train_inside <- sales_train[,"inside_ts"]
actual_train_food <- sales_train[,"food_ts"]
actual_train_diesel <- sales_train[,"diesel_ts"]
actual_train_unleaded <- sales_train[,"unleaded_ts"]

forecasted_train_inside <- train_forecast$fcst$inside_ts[,"fcst"]
forecasted_train_food <- train_forecast$fcst$food_ts[,"fcst"]
forecasted_train_diesel <- train_forecast$fcst$diesel_ts[,"fcst"]
forecasted_train_unleaded <- train_forecast$fcst$unleaded_ts[,"fcst"]

# Calculate residuals (errors) for each response variable - training dataset
train_inside_errors <- actual_train_inside - forecasted_train_inside
train_food_errors <- actual_train_food - forecasted_train_food
train_diesel_errors <- actual_train_diesel - forecasted_train_diesel
train_unleaded_errors <- actual_train_unleaded - forecasted_train_unleaded

# Calculate MAE for each response variable - training dataset
mae_train_inside <- mean(abs(train_inside_errors))
mae_train_food <- mean(abs(train_food_errors))
mae_train_diesel <- mean(abs(train_diesel_errors))
mae_train_unleaded <- mean(abs(train_unleaded_errors))

# Calculate MSE for each response variable - training dataset
mse_train_inside <- mean(train_inside_errors^2)
mse_train_food <- mean(train_food_errors^2)
mse_train_diesel <- mean(train_diesel_errors^2)
mse_train_unleaded <- mean(train_unleaded_errors^2)

# Calculate RMSE for each response variable - training dataset
rmse_train_inside <- sqrt(mse_train_inside)
rmse_train_food <- sqrt(mse_train_food)
rmse_train_diesel <- sqrt(mse_train_diesel)
rmse_train_unleaded <- sqrt(mse_train_unleaded)

# Create a data frame with the metrics
error_metric_train <- data.frame(
  Metric = c("inside_ts - Training", "food_ts - Training", "diesel_ts - Training", "unleaded_ts - Training"
             ),
  MAE = c(mae_train_inside, mae_train_food, mae_train_diesel, mae_train_unleaded),
  MSE = c(mse_train_inside,mse_train_food,mse_train_diesel,mse_train_unleaded),
  RMSE = c(rmse_train_inside,rmse_train_food,rmse_train_diesel,rmse_train_unleaded)
)

# Print the data frame
print(error_metric_train)
##                   Metric       MAE       MSE      RMSE
## 1   inside_ts - Training 1412.0495 3192019.0 1786.6222
## 2     food_ts - Training  460.1878  362257.5  601.8783
## 3   diesel_ts - Training 1037.9166 1674808.2 1294.1438
## 4 unleaded_ts - Training 1144.2826 2105038.0 1450.8749

Test Data

# Extract the actual and forecasted values for each response variable - test data
actual_test_inside <- sales_test[,"inside_ts_test"]
actual_test_food <- sales_test[,"food_ts_test"]
actual_test_diesel <- sales_test[,"diesel_ts_test"]
actual_test_unleaded <- sales_test[,"unleaded_ts_test"]

forecasted_test_inside <- inside_forecast_values[,"fcst"]
forecasted_test_food <- food_forecast_values[,"fcst"]
forecasted_test_diesel <- diesel_forecast_values[,"fcst"]
forecasted_test_unleaded <- unleaded_forecast_values[,"fcst"]

# Calculate residuals (errors) for each response variable - test dataset
test_inside_errors <- actual_test_inside - forecasted_test_inside
test_food_errors <- actual_test_food - forecasted_test_food
test_diesel_errors <- actual_test_diesel - forecasted_test_diesel
test_unleaded_errors <- actual_test_unleaded - forecasted_test_unleaded

# Calculate MAE for each response variable - testing dataset
mae_test_inside <- mean(abs(test_inside_errors))
mae_test_food <- mean(abs(test_food_errors))
mae_test_diesel <- mean(abs(test_diesel_errors))
mae_test_unleaded <- mean(abs(test_unleaded_errors))

# Calculate MSE for each response variable - testing dataset
mse_test_inside <- mean(test_inside_errors^2)
mse_test_food <- mean(test_food_errors^2)
mse_test_diesel <- mean(test_diesel_errors^2)
mse_test_unleaded <- mean(test_unleaded_errors^2)

# Calculate RMSE for each response variable - testing dataset
rmse_test_inside <- sqrt(mse_test_inside)
rmse_test_food <- sqrt(mse_test_food)
rmse_test_diesel <- sqrt(mse_test_diesel)
rmse_test_unleaded <- sqrt(mse_test_unleaded)

# Create a data frame with the metrics
error_metric_test <- data.frame(
  Metric = c("inside_ts - Test", "food_ts - Test", "diesel_ts - Test", "unleaded_ts - Test"
             ),
  MAE = c(mae_test_inside, mae_test_food, mae_test_diesel, mae_test_unleaded),
  MSE = c(mse_test_inside,mse_test_food,mse_test_diesel,mse_test_unleaded),
  RMSE = c(rmse_test_inside,rmse_test_food,rmse_test_diesel,rmse_test_unleaded)
)

# Print the data frame
print(error_metric_test)
##               Metric       MAE       MSE      RMSE
## 1   inside_ts - Test 1728.8151 4822152.8 2195.9401
## 2     food_ts - Test  595.9288  581166.1  762.3425
## 3   diesel_ts - Test  917.9421 1277537.1 1130.2819
## 4 unleaded_ts - Test 1823.7715 5632228.0 2373.2315

The training MAE, MSE, and RMSE values for all four variables are lower than the corresponding test data values. This suggests that the model overfits the training data and does not generalize well to the test data.

The inside_ts variable has the highest training and test MAE, MSE, and RMSE values, followed by unleaded_ts, diesel_ts, and food_ts. This suggests that the model is less accurate at forecasting inside_ts and unleaded_ts than it is at forecasting diesel_ts and food_ts.

The VAR model overfits the training data and does not generalize well to the test data. The model is less accurate at forecasting inside_ts and unleaded_ts than it is at forecasting diesel_ts and food_ts.

Prophet Model

Preparing Data

qualitative <- read.csv("qualitative_data_msba.csv")
time_series <- read.csv("time_series_data_msba.csv")
#q_data <- read.csv("q_data.csv")

#t_series <- read.csv("t_series.csv")

#merged <- read.csv("merged_data.csv")
# Fixed Data

ts_data <- read.csv("t_series.csv") 
t_series <- ts_data %>% filter(site_id != 23065)

fixed_cnames <- colnames(read.csv("q_data.csv") %>%
                           mutate(men_toilet_count = NA,
                                  .after = self_check_out) %>%
                           select(-rv_fueling_positions))[c(1:39, 41, 42, 40, 43:52)]

q_data <- read.csv("qualitative_data_msba.csv") %>%
  select(-c(1, `RV_Lanes_Fueling_Positions`, `Hi_Flow_Lanes_Fueling_Positions`)) %>%
  mutate(
    across(
      where(~any(grepl("^N/?A$", ., ignore.case = TRUE))),
      ~replace(., grepl("^N/?A$", ., ignore.case = TRUE), "None")
    )
  ) %>%
  rename_with(~fixed_cnames)

merged_data <- t_series %>%
  left_join(q_data,
            "site_id")

Fitting the Model

# Rename columns if they are named differently
model_df <- merged_data %>%
  rename(ds = date, y = inside_sales)

colnames(model_df)
##  [1] "open_date"                     "ds"                           
##  [3] "week_id"                       "day_of_week"                  
##  [5] "holiday"                       "day_type"                     
##  [7] "y"                             "food_service_sales"           
##  [9] "diesel_sales"                  "unleaded_sales"               
## [11] "site_id"                       "open_year"                    
## [13] "square_feet"                   "fDoor_count"                  
## [15] "years_since_last_project"      "parking_spaces"               
## [17] "lottery"                       "freal"                        
## [19] "bonfire_grill"                 "pizza"                        
## [21] "cinnabon"                      "godfathers_pizza"             
## [23] "ethanol_free"                  "diesel"                       
## [25] "hi_flow_lanes"                 "rv_lanes"                     
## [27] "hi_flow_rv_lanes"              "def"                          
## [29] "cat_scales"                    "car_wash"                     
## [31] "ev_charging"                   "rv_dumps"                     
## [33] "propane"                       "x1_mile_pop"                  
## [35] "x1_mile_emp"                   "x1_mile_income"               
## [37] "x1_2_mile_pop"                 "x1_2_mile_emp"                
## [39] "x1_2_mile_income"              "x5_min_pop"                   
## [41] "x5_min_emp"                    "x5_min_inc"                   
## [43] "x7_min_pop"                    "x7_min_emp"                   
## [45] "x7_min_inc"                    "traditional_fueling_positions"
## [47] "traditional_forecourt_layout"  "traditional_forecourt_stack"  
## [49] "rv_forecourt_layout"           "rv_forecourt_stack"           
## [51] "hi_flow_forecourt_layout"      "hi_flow_forecourt_stack"      
## [53] "hi_flow_fueling_positions"     "rv_lanes_fueling_positions"   
## [55] "hi_flow_rv_forecourt_layout"   "hi_flow_rv_forecourt_stack"   
## [57] "non_24_hour"                   "self_check_out"               
## [59] "men_toilet_count"              "men_urinal_count"             
## [61] "women_toilet_count"            "women_sink_count"
# Splitting point for an 80:20 split
split_point <- floor(0.8 * nrow(model_df))
print(split_point)
## [1] 10833
nrow(merged_data)
## [1] 13542
# Splitting into train and test sets
train_data <- model_df[1:split_point, ]
nrow(train_data)
## [1] 10833
test_data <- model_df[(split_point + 1):nrow(model_df), ]
nrow(test_data)
## [1] 2709
# Fitting the Prophet model on the training data
model <- prophet()
model_fit <- fit.prophet(model, train_data)
## Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.
# Creating a future dataframe for predictions on the test data
future <- make_future_dataframe(model_fit, periods = nrow(test_data))

# Making predictions on the test data
forecast <- predict(model_fit, future)

# Visualize the forecast
plot(model_fit, forecast)

# Accuracy Metrics
accuracy_metrics <- accuracy(forecast$yhat, test_data$y)
print(accuracy_metrics)
##                 ME     RMSE     MAE  MPE MAPE
## Test set -416.7406 1313.426 1055.11 -Inf  Inf

ME (Mean Error): The test set’s average deviation between predicted and actual values is roughly -416.74 units.

RMSE (Root Mean Squared Error): The average error between the model’s predictions and the test set’s actual values is 1313.426 units.

The model’s average deviation from the actual values in the test set is around 1055.11 units, as indicated by the MAE (Mean Absolute Error).

The average percentage difference between the predicted and actual values is called the Mean Percentage Error, or MPE for short.

These metrics provide insights into the model’s performance, indicating how well the forecasted values align with the observed values. The negative ME suggests an overall overestimation by the model.

Fitting The model with Regressors

#library(prophet)

# Create a Prophet model
model_reg <- prophet()

# Add additional regressor variables
model_reg <- add_regressor(model_reg, name = "food_service_sales")
model_reg <- add_regressor(model_reg, name = "diesel_sales")
model_reg <- add_regressor(model_reg, name = "unleaded_sales")

# Fit the model with additional regressors
model_reg <- fit.prophet(model_reg, model_df)
## Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.
summary(model_reg)
##                         Length Class      Mode     
## growth                    1    -none-     character
## changepoints             25    POSIXct    numeric  
## n.changepoints            1    -none-     numeric  
## changepoint.range         1    -none-     numeric  
## yearly.seasonality        1    -none-     character
## weekly.seasonality        1    -none-     character
## daily.seasonality         1    -none-     character
## holidays                  0    -none-     NULL     
## seasonality.mode          1    -none-     character
## seasonality.prior.scale   1    -none-     numeric  
## changepoint.prior.scale   1    -none-     numeric  
## holidays.prior.scale      1    -none-     numeric  
## mcmc.samples              1    -none-     numeric  
## interval.width            1    -none-     numeric  
## uncertainty.samples       1    -none-     numeric  
## specified.changepoints    1    -none-     logical  
## start                     1    POSIXct    numeric  
## y.scale                   1    -none-     numeric  
## logistic.floor            1    -none-     logical  
## t.scale                   1    -none-     numeric  
## changepoints.t           25    -none-     numeric  
## seasonalities             2    -none-     list     
## extra_regressors          3    -none-     list     
## country_holidays          0    -none-     NULL     
## stan.fit                  4    -none-     list     
## params                    6    -none-     list     
## history                  65    data.frame list     
## history.dates           947    POSIXct    numeric  
## train.holiday.names       0    -none-     NULL     
## train.component.cols      8    data.frame list     
## component.modes           2    -none-     list     
## fit.kwargs                0    -none-     list
# Make future predictions
future <- make_future_dataframe(model_reg, periods = 365)  # Example: forecast for 365 days

length(model_df$food_service_sales)
## [1] 13542
# Assuming future has fewer rows than model_df$food_service_sales
future$food_service_sales <- head(model_df$food_service_sales, nrow(future))
future$diesel_sales <- head(model_df$diesel_sales, nrow(future))
future$unleaded_sales <- head(model_df$unleaded_sales, nrow(future))

#future$food_service_sales <- model_df$food_service_sales
#future$diesel_sales <- model_df$diesel_sales 
#future$unleaded_sales <- model_df$unleaded_sales

forecast <- predict(model_reg, future)

colnames(forecast)
##  [1] "ds"                              "trend"                          
##  [3] "additive_terms"                  "additive_terms_lower"           
##  [5] "additive_terms_upper"            "diesel_sales"                   
##  [7] "diesel_sales_lower"              "diesel_sales_upper"             
##  [9] "extra_regressors_additive"       "extra_regressors_additive_lower"
## [11] "extra_regressors_additive_upper" "food_service_sales"             
## [13] "food_service_sales_lower"        "food_service_sales_upper"       
## [15] "unleaded_sales"                  "unleaded_sales_lower"           
## [17] "unleaded_sales_upper"            "weekly"                         
## [19] "weekly_lower"                    "weekly_upper"                   
## [21] "yearly"                          "yearly_lower"                   
## [23] "yearly_upper"                    "multiplicative_terms"           
## [25] "multiplicative_terms_lower"      "multiplicative_terms_upper"     
## [27] "yhat_lower"                      "yhat_upper"                     
## [29] "trend_lower"                     "trend_upper"                    
## [31] "yhat"
# Visualize forecast
plot(model_reg, forecast)

By taking into account these extra variables and utilising Prophet’s ability to include external regressors, this code can produce forecasts that may provide more thorough predictions due to the inclusion of various impacting factors.

Results

Forecasts for future time periods were produced by applying the Prophet algorithm-based initial forecasting model to the given dataset. Metrics including Mean Error (ME), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), Mean Percentage Error (MPE), and Mean Absolute Percentage Error (MAPE) were used to evaluate the forecast’s accuracy. The performance of the model and the size of the forecast errors were shown by the precise values of these measures.

Using the Prophet algorithm, a more advanced forecasting model was developed by adding three more regressor variables: “food_service_sales,” “diesel_sales,” and “unleaded_sales.” Using the supplied dataset, the model was trained to produce predictions for upcoming times using both historical data and extra regressor factors.

These code snippets show how to perform time series forecasting using the Prophet model with and without extra regressor variables. With the addition of more regressors, the forecast should be more accurate and complete as a result of taking into account a variety of influencing factors.

Support Vector Regression Model (SVR)

Data Preparation

# Load time series data

tdata <- read.csv("time_series_data_msba.csv")

# Check for NAs in the time series data

colSums(is.na(tdata))
##                                  X capital_projects.soft_opening_date 
##                                  0                                  0 
##         calendar.calendar_day_date   calendar.fiscal_week_id_for_year 
##                                  0                                  0 
##               calendar.day_of_week       calendar_information.holiday 
##                                  0                                  0 
##   calendar_information.type_of_day   daily_yoy_ndt.total_inside_sales 
##                                  0                                  0 
##   daily_yoy_ndt.total_food_service                             diesel 
##                                  0                                  0 
##                           unleaded                       site_id_msba 
##                                  0                                  0
# View Structure of the time series data 
str(tdata)
## 'data.frame':    13908 obs. of  12 variables:
##  $ X                                 : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ capital_projects.soft_opening_date: chr  "2022-06-14" "2022-06-14" "2022-06-14" "2022-06-14" ...
##  $ calendar.calendar_day_date        : chr  "2022-06-17" "2022-06-22" "2022-06-23" "2022-06-26" ...
##  $ calendar.fiscal_week_id_for_year  : int  25 25 25 26 26 26 27 27 27 28 ...
##  $ calendar.day_of_week              : chr  "Friday" "Wednesday" "Thursday" "Sunday" ...
##  $ calendar_information.holiday      : chr  "NONE" "NONE" "NONE" "NONE" ...
##  $ calendar_information.type_of_day  : chr  "WEEKDAY" "WEEKDAY" "WEEKDAY" "WEEKEND" ...
##  $ daily_yoy_ndt.total_inside_sales  : num  2168 2052 2258 1521 1898 ...
##  $ daily_yoy_ndt.total_food_service  : num  862 808 966 542 771 ...
##  $ diesel                            : num  723 730 896 584 852 ...
##  $ unleaded                          : num  1425 1436 1594 1125 1640 ...
##  $ site_id_msba                      : int  24535 24535 24535 24535 24535 24535 24535 24535 24535 24535 ...
# Change the names of the time series data columns and replace with the part after the period(.)

colnames <- colnames(tdata)
new_col_names <- sub(".+\\.", "", colnames)
colnames(tdata) <- new_col_names
#Convert 'calendar_day_date' to Date format

tdata$calendar_day_date <- as.Date(tdata$calendar_day_date)
## Encode the categorical data 
convert_to_factors <- function(data) {
  # Get the column names of categorical variables
  categorical_columns <- sapply(data, function(col) is.factor(col) || is.character(col))
  
  # Convert categorical columns to factors
  data[categorical_columns] <- lapply(data[categorical_columns], as.factor)
  
  return(data)
}

ts_data <- convert_to_factors(tdata)


# Remove columns which will not effect the time series analysis
ts_data <- ts_data %>% select(-c(`site_id_msba`, `soft_opening_date`))
# Scaling Numeric and Integer data columns

# Identify numeric and integer columns
numeric_columns <- sapply(ts_data, is.numeric)
integer_columns <- sapply(ts_data, is.integer)

# Combine numeric and integer columns
columns_to_scale <- numeric_columns | integer_columns

# Apply scaling to selected columns
ts_data[columns_to_scale] <- lapply(ts_data[columns_to_scale], scale)
# Split the data into train and test sets. 

set.seed(123)
num_rows <- nrow(ts_data)
train_index <- sample(1:num_rows, 0.8 * num_rows)

train_data <- ts_data[train_index, ]
test_data <- ts_data[-train_index, ]

Model Selection - Support Vector Regression & Hyperparamter Tuning

# Create 4 Support Vector Models for each of the target variables

svr_total_inside_sales <- svm(total_inside_sales~., data = train_data, type = "eps-regression", kernel = "radial", epsilon = 0.15)

svr_total_food_service <- svm(total_food_service~., data = train_data, type = "eps-regression", kernel = "radial", epsilon = 0.15)

svr_diesel <- svm(diesel~., data = train_data, type = "eps-regression", kernel = "radial", epsilon = 0.15)

svr_unleaded <- svm(unleaded~., data = train_data, type = "eps-regression", kernel = "radial", epsilon = 0.15)
# Perform prediction of each of 4 target variables using test data

predictions_total_inside_sales <- predict(svr_total_inside_sales, test_data)

predictions_total_food_service <- predict(svr_total_food_service, test_data)

predictions_diesel <- predict(svr_diesel, test_data)

predictions_unleaded <- predict(svr_unleaded, test_data)

Support Vector Regression is used to find a hyperplane that fits the data points in a high-dimensional space. This hyperplane is determined by maximizing the margin between the data points and the hyperplane, subject to a user-defined tolerance for errors (controlled by a parameter called “epsilon”). For a multivariate time series analysis, SVR, treats the input data as a multivariate time series. Instead of using a single independent variable to predict the target variable, one can use multiple features as input to predict the target variable. SVR, unlike Linear regression takes into consideration non-linearity amongst data points.

Evaluating Model Performance using MSE/RMSE/R-Squared

#Evaluating model performance by Calculating MSE

mse_total_inside_sales <- mean((predictions_total_inside_sales - test_data$total_inside_sales)^2)
mse_total_food_service <- mean((predictions_total_food_service - test_data$total_food_service)^2)
mse_diesel <- mean((predictions_diesel - test_data$diesel)^2)
mse_unleaded <- mean((predictions_unleaded - test_data$unleaded)^2)


cat("Mean Squared Error for total_inside_sales:", mse_total_inside_sales, "\n")
## Mean Squared Error for total_inside_sales: 0.1189227
cat("Mean Squared Error for total_food_service:", mse_total_food_service, "\n")
## Mean Squared Error for total_food_service: 0.07742735
cat("Mean Squared Error for diesel:", mse_diesel, "\n")
## Mean Squared Error for diesel: 0.3220052
cat("Mean Squared Error for unleaded:",mse_unleaded, "\n")
## Mean Squared Error for unleaded: 0.6943869
#Evaluating model performance by Calculating RMSE

rmse_total_inside_sales <- sqrt(mean((predictions_total_inside_sales - test_data$total_inside_sales)^2))
rmse_total_food_service <- sqrt(mean((predictions_total_food_service - test_data$total_food_service)^2))
rmse_diesel <- sqrt(mean((predictions_diesel - test_data$diesel)^2))
rmse_unleaded <- sqrt(mean((predictions_unleaded - test_data$unleaded)^2))


cat("Root Mean Squared Error for total_inside_sales:", rmse_total_inside_sales, "\n")
## Root Mean Squared Error for total_inside_sales: 0.3448517
cat("Root Mean Squared Error for total_food_service:", rmse_total_food_service, "\n")
## Root Mean Squared Error for total_food_service: 0.2782577
cat("Root Mean Squared Error for diesel:", rmse_diesel, "\n")
## Root Mean Squared Error for diesel: 0.567455
cat("Root Mean Squared Error for unleaded:", rmse_unleaded, "\n")
## Root Mean Squared Error for unleaded: 0.8332988
#Evaluating model performance by Calculating R-Squared

r2_total_inside_sales <- 1 - sum((test_data$total_inside_sales - predictions_total_inside_sales)^2) / sum((test_data$total_inside_sales - mean(test_data$total_inside_sales))^2)

r2_total_food_service <- 1 - sum((test_data$total_food_service - predictions_total_food_service)^2) / sum((test_data$total_food_service - mean(test_data$total_food_service))^2)

r2_diesel <- 1 - sum((test_data$diesel - predictions_diesel)^2) / sum((test_data$diesel - mean(test_data$diesel))^2)

r2_unleaded <- 1 - sum((test_data$unleaded - predictions_unleaded)^2) / sum((test_data$unleaded - mean(test_data$unleaded))^2)


cat("R-Squared for total_inside_sales:", r2_total_inside_sales, "\n")
## R-Squared for total_inside_sales: 0.8811577
cat("R-Squared for total_food_service:", r2_total_food_service, "\n")
## R-Squared for total_food_service: 0.9205895
cat("R-Squared for diesel:", r2_diesel, "\n")
## R-Squared for diesel: 0.6560366
cat("R-Squared for unleaded:", r2_unleaded, "\n")
## R-Squared for unleaded: 0.3067179

Results

Support vector for individual models has given good results when considering R-Squared. We can observe that the R-squared is highest for total_food_services. Simultaneously the RMSE values for total_food_service is much lower implying the model has been able to reduce error to a great extent.

Extreme Gradient Boosting (XGBoost)

Data Preparation

# get desired column names from EDA
fixed_cnames <- colnames(read_csv(paste0("q_data.csv")) %>%
                           mutate(men_toilet_count = NA,
                                  .after = self_check_out) %>%
                           select(-rv_fueling_positions))[c(1:39, 41, 42, 40, 43:52)]
## Rows: 37 Columns: 52
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (26): lottery, freal, bonfire_grill, pizza, cinnabon, godfathers_pizza, ...
## dbl (26): open_year, square_feet, fDoor_count, years_since_last_project, par...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
q_data <- read_csv(paste0("qualitative_data_msba.csv")) %>%
  # Remove row index and duplicated columns
  select(-c(1, `RV Lanes Fueling Positions`, `Hi-Flow Lanes Fueling Positions`)) %>%
  # properly encode "None"
  mutate(
    across(
      where(~any(grepl("^N/?A$", ., ignore.case = TRUE))),
      ~replace(., grepl("^N/?A$", ., ignore.case = TRUE), "None")
    )
  ) %>%
  rename_with(~fixed_cnames) %>%
  relocate(site_id) %>%
  # omitting zero-variance variables
  select(-c(fDoor_count, godfathers_pizza, diesel, car_wash, 
            ev_charging, non_24_hour, self_check_out))
## New names:
## Rows: 37 Columns: 55
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (27): Lottery, Freal, Bonfire Grill, Pizza, Cinnabon, Godfather’s Pizza,... dbl
## (28): ...1, Open Year, Square Feet, Front Door Count, Years Since Last P...
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
# Calculate standardized day id
day_id_df <- tibble(date = seq(as_date("2021-01-01"), as_date("2023-12-31"), "1 day")) %>%
  # Calculate week_id
  mutate(week_id = yearweek(date, week_start = 5) %>% format("%V") %>% as.numeric(),
         # since the first day of fiscal year 2022 is actually in 2021, special logic must be 
         # applied to identify the beginning of the year
         x = case_when(lag(week_id, default = 52) == 52 & week_id == 1 ~ 1),
         year = 2020 + rollapplyr(x, width = n(), FUN = sum, na.rm = TRUE, partial = TRUE)) %>%
  group_by(year) %>%
  mutate(day_id = row_number()) %>%
  select(-x) %>%
  ungroup()

t_series <- read_csv(paste0("t_series.csv")) %>%
  # remove missing store
  filter(site_id != 23065) %>%
  relocate(site_id, date) %>%
  arrange(site_id, date) %>%
  mutate(id = row_number(),
         .before = 1) %>%
  left_join(day_id_df %>%
              select(date, day_id), "date") %>%
  group_by(site_id) %>%
  mutate(first_day_id = first(day_id)) %>%
  ungroup() %>%
  arrange(first_day_id, site_id) %>%
  group_by(site_id) %>%
  # Encode an alternative day_id which can exist in 2 years
  mutate(day_id2 = purrr::accumulate(day_id, ~ifelse(.x < .y, .y, .y + 364)),
         date2 = as_date(as.numeric(as_date("2021-01-01")) + (day_id2 - 1))) %>%
  ungroup() %>%
  select(-c(first_day_id))
## Rows: 13908 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (3): day_of_week, holiday, day_type
## dbl  (6): week_id, inside_sales, food_service_sales, diesel_sales, unleaded_...
## date (2): open_date, date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
merged_data <- t_series %>%
  left_join(q_data,
            "site_id") %>%
  arrange(site_id, date)

Maverik expressed the importance of aligning days in a standardized manner, which is why week_id is included and why I created a day_id.

XGBoost can only handle numeric data. I use dummy variables to handle this.

dummy_targ <- merged_data %>%
  # get character columns
  select(where(is.character)) %>%
  colnames()

# create dummy variables
dummy_df <- dummyVars(~.,
                      merged_data %>%
                        select(all_of(dummy_targ))) %>%
  predict(merged_data %>%
            select(all_of(dummy_targ))) %>%
  as_tibble() %>%
  # Ensure one level for each column is left out
  select(!matches("no(ne)?$"), day_of_weekSunday, day_typeWEEKEND, `traditional_forecourt_layoutIn-Line`)

Including features derived from an ARIMA model may prove to be helpful in an XGBoost regressor. The innovation residuals and fitted values are calculated for each sales metric for each site.

mts <- merged_data %>%
  # Convert to wide form
  pivot_longer(inside_sales:unleaded_sales,
               names_to = "met",
               values_to = "sales") %>%
  # create tsibble grouped on site and sales metric
  as_tsibble(index = date, key = c(site_id, met))
# fit arima model
mts_fit <- mts %>%
  model(
    arima = ARIMA(sales, stepwise = FALSE)
  )
# get desired metrics
mts_aug <- mts_fit %>%
  augment() %>%
  as_tibble() %>%
  pivot_wider(id_cols = c(site_id, date),
              names_from = met,
              values_from = c(.fitted, .innov)) %>%
  # rename output
  rename_with(~gsub("(\\..*?)_(.*)", "\\2\\1", .),
              contains("."))

The dummy variables and ARIMA features are incorporated into the model data. Other features were experimented with and omitted in final iterations. XGBoost doesn’t handle dates very well so their components need to be split into separate columns.

mdf <- merged_data %>% 
  arrange(site_id, date) %>%
  relocate(id, site_id, day_id) %>%
  relocate(open_year, .after = day_id) %>%
  # Split dates 
  mutate(open_month = month(open_date),
         open_day = day(open_date),
         .after = open_year) %>% 
  # join ARIMA features
  left_join(mts_aug, c("site_id", "date")) %>%
  mutate(year = year(date),
         month = month(date),
         day = day(date),
         .after = date) %>%
  group_by(site_id) %>%
  # include lagged values of the sales features
  mutate(across(contains("sales"),
                list(l1 = ~. - lag(.),
                     l7 = ~. - lag(., 7)))) %>%
  ungroup() %>%
  # remove undesired columns
  select(-c(date, open_date, day_id2, date2, all_of(dummy_targ))) %>%
  bind_cols(dummy_df)

I decided to hold out entire sites when creating the train test splits so that I can plot entire sales history later on as opposed to random, disjoint dates.

set.seed(1234) 
train_sites <- sample(unique(mdf$site_id), 30)


mdf %>%
  filter(!site_id %in% train_sites) %>%
  pivot_longer(inside_sales:unleaded_sales,
               names_to = "met",
               values_to = "sales") %>%
  relocate(met, sales, .after = open_year) %>%
  ggplot() +
  geom_line(aes(day_id, sales, color = met)) +
  facet_rep_wrap(~site_id, repeat.tick.labels = TRUE, scales = "free_y", ncol = 2) +
  theme_minimal() +
  theme(legend.position = "top") +
  labs(title = "Sales history of hold-out sites")

Since predictions need to be made for four separate targets, the appropriates sets have to be defined. XGBoost requires a special class to include in watchlist so a separate DMatrix is made for each target variable.

# Train
train_all <- mdf %>%
  filter(site_id %in% train_sites)

train_is <- train_all %>%
  select(-c(id, site_id, inside_sales, open_year:day, matches("\\.(fitted|innov)$")))
train_fs <- train_all %>%
  select(-c(id, site_id, food_service_sales, open_year:day, matches("\\.(fitted|innov)$")))
train_d <- train_all %>%
  select(-c(id, site_id, diesel_sales, open_year:day, matches("\\.(fitted|innov)$")))
train_u <- train_all %>%
  select(-c(id, site_id, unleaded_sales, open_year:day, matches("\\.(fitted|innov)$")))

train_df <- train_all %>%
  select(-c(id, site_id, inside_sales:unleaded_sales, open_year:day, matches("\\.(fitted|innov)$")))

# Test
test_all <- mdf %>%
  filter(!site_id %in% train_sites)

test_is <- test_all %>%
  select(-c(id, site_id, inside_sales, open_year:day, matches("\\.(fitted|innov)$")))
test_fs <- test_all %>%
  select(-c(id, site_id, food_service_sales, open_year:day, matches("\\.(fitted|innov)$")))
test_d <- test_all %>%
  select(-c(id, site_id, diesel_sales, open_year:day, matches("\\.(fitted|innov)$")))
test_u <- test_all %>%
  select(-c(id, site_id, unleaded_sales, open_year:day, matches("\\.(fitted|innov)$")))

test_df <- test_all  %>%
  select(-c(id, site_id, inside_sales:unleaded_sales, open_year:day, matches("\\.(fitted|innov)$")))

# Prepare modeling data
train_dmat_is <- xgb.DMatrix(data = as.matrix(train_df), label = train_all$inside_sales)
train_dmat_fs <- xgb.DMatrix(data = as.matrix(train_df), label = train_all$food_service_sales)
train_dmat_d <- xgb.DMatrix(data = as.matrix(train_df), label = train_all$diesel_sales)
train_dmat_u <- xgb.DMatrix(data = as.matrix(train_df), label = train_all$unleaded_sales)

test_dmat_is <- xgb.DMatrix(data = as.matrix(test_df), label = test_all$inside_sales)
test_dmat_fs <- xgb.DMatrix(data = as.matrix(test_df), label = test_all$food_service_sales)
test_dmat_d <- xgb.DMatrix(data = as.matrix(test_df), label = test_all$diesel_sales)
test_dmat_u <- xgb.DMatrix(data = as.matrix(test_df), label = test_all$unleaded_sales)

Modeling

To decide on a final model, I’ll define a tuning grid of 144 hyperparameters to use in 3-fold cross validation. I will implement this first on inside_sales and use the best paramters on the other targets.

# Train control
tc <- trainControl(method = "cv",
                   # 3-fold cv
                   number = 3,
                   allowParallel = FALSE,
                   returnResamp = "all",
                   returnData = TRUE,
                   savePredictions = TRUE)

# Hyperparameters ----
params <- expand.grid(eta = c(0.05, 0.01),
                      subsample = c(0.5, 0.8, 1),
                      gamma = c(2, 5, 7),
                      max_depth = c(3, 12),
                      min_child_weight = c(2, 5),
                      colsample_bytree = c(0.5, 0.9),
                      nrounds = 1000)

Parallel processing was utilized to save time. Strangely, I repeatedly encountered an error stating “Inconsistent values between best_tune and xgb.attr”, but only when running on a Mac. This error never occurred when running on a Windows machine. Given the stochastic nature of XGBoost training, the error didn’t always occur, and thus in part depended on the random seed. This means that if training was attempted after an error, a valid result may be obtained. I used tryCatch in the foreach loop to keep trying until a model was trained.

# set seed
set.seed(123)
# use 9 cores
cl <- makeForkCluster(8)
registerDoParallel(cl)
# record start time
(xtime1 <- Sys.time())

# Initiate loop
xgb_par <- foreach(i = 1:nrow(params),
                   .packages = c("xgboost", "caret", "tidyverse"),
                   .verbose = TRUE#, .errorhandling = "pass"
) %dopar% {
  
  # foreach doesn't import xgb.DMatrix objects so they must be defined again here
  train_dmat_is <- xgb.DMatrix(data = as.matrix(train_df), label = train_all$inside_sales)
  test_dmat_is <- xgb.DMatrix(data = as.matrix(test_df), label = test_all$inside_sales)
  set.seed(123)  
  
  # Initialize failed state
  xgb_model <- "failed"
  
  # keep trying until success
  while(class(xgb_model) == "character"){
    xgb_model <- tryCatch({
      train(
        x = as.data.frame(train_is),
        y = train_all$inside_sales,
        method = "xgbTree",
        watchlist = list(train = train_dmat_is, test = test_dmat_is),
        tuneGrid = params[i,],
        trControl = tc,
        early_stopping_rounds = 20,
        verbose = 0
      )
    }, error = function(e) {
      "failed"
    })
  }
  gc()
  
  xgb_model
  
}

# stop cluster
stopCluster(cl)
stopImplicitCluster()
# record run-time
xtime2 <- Sys.time()
xtime2 - xtime1
gc()

CV Evaluation

For each model, I make predictions on the test and train set, obtain performance metrics, combine them all into a dataframe, and take the best results based on each metric.

# Evaluate results ----
tune_res <- lapply(xgb_par,
                   \(x){
                     
                     if("error" %in% class(x)){
                       tibble()
                     } else {
                       train_preds <- predict(x, train_is)
                       train_res <- postResample(train_preds, train_all$inside_sales) %>%
                         as.list() %>%
                         as_tibble() %>%
                         rename_with(~paste0("train_", .))
                       
                       test_preds <- predict(x, test_is)
                       test_res <- postResample(test_preds, test_all$inside_sales) %>%
                         as.list() %>%
                         as_tibble() %>%
                         rename_with(~paste0("test_", .))
                       
                       x$results %>%
                         as_tibble() %>%
                         mutate(train_res,
                                test_res,
                                train_preds = list(train_preds),
                                test_preds = list(test_preds))
                     }
                     
                     
                   }) %>%
  list_rbind() %>%
  relocate(eta:nrounds, .after = test_MAE) %>%
  # determine severity of overfitting by calculating difference between test and train metrics
  mutate(id = row_number(),
         rmse_diff = test_RMSE - train_RMSE,
         rsq_diff = train_Rsquared - test_Rsquared,
         .before = 1)

head(tune_res)
# Get hyperparameters of best 3 RMSE (cv, train, and test) & Rsquared (cv, train, and test), 

bind_rows(
  tune_res %>%
    select(id:Rsquared, train_RMSE, train_Rsquared, test_RMSE, test_Rsquared,
           eta:colsample_bytree) %>%
    arrange(RMSE) %>%
    slice_head(n = 3) %>%
    mutate(tvar = "RMSE", .before = 1),
  
  tune_res %>%
    select(id:Rsquared, train_RMSE, train_Rsquared, test_RMSE, test_Rsquared,
           eta:colsample_bytree) %>%
    arrange(-Rsquared) %>%
    slice_head(n = 3) %>%
    mutate(tvar = "Rsqaured", .before = 1),
  
  tune_res %>%
    select(id:Rsquared, train_RMSE, train_Rsquared, test_RMSE, test_Rsquared,
           eta:colsample_bytree) %>%
    arrange(test_RMSE) %>%
    slice_head(n = 3) %>%
    mutate(tvar = "test_RMSE", .before = 1),
  
  tune_res %>%
    select(id:Rsquared, train_RMSE, train_Rsquared, test_RMSE, test_Rsquared,
           eta:colsample_bytree) %>%
    arrange(-test_Rsquared) %>%
    slice_head(n = 3) %>%
    mutate(tvar = "test_Rsquared", .before = 1),
  
  tune_res %>%
    select(id:Rsquared, train_RMSE, train_Rsquared, test_RMSE, test_Rsquared,
           eta:colsample_bytree) %>%
    arrange(rmse_diff) %>%
    slice_head(n = 3) %>%
    mutate(tvar = "rmse_diff", .before = 1),
  
  tune_res %>%
    select(id:Rsquared, train_RMSE, train_Rsquared, test_RMSE, test_Rsquared,
           eta:colsample_bytree) %>%
    arrange(rsq_diff) %>%
    slice_head(n = 3) %>%
    mutate(tvar = "rsq_diff", .before = 1)
) 
# Feature importance
varImp(xgb_par[[22]])
## xgbTree variable importance
## 
##   only 20 most important variables shown (out of 124)
## 
##                           Overall
## food_service_sales        100.000
## food_service_sales.fitted  77.725
## unleaded_sales             11.188
## x1_mile_pop                10.804
## pizzaYes                   10.769
## x5_min_pop                  8.152
## unleaded_sales.fitted       7.830
## x1_mile_income              5.163
## open_month                  3.846
## day_id                      3.768
## diesel_sales.fitted         3.610
## parking_spaces              3.429
## inside_sales_l1             3.384
## diesel_sales                3.330
## inside_sales_l7             3.081
## x7_min_pop                  2.899
## square_feet                 2.756
## x1_2_mile_pop               2.393
## open_day                    2.054
## day_typeWEEKDAY             1.905

Model 22 yielded the best RMSE and Rsquared in CV, so its hyperparamters will be used to predict the remaining sales metrics.

# Food service
set.seed(123)
xgb_fs <- "failed"

while(class(xgb_fs) == "character"){
  xgb_fs <- tryCatch({
    train(
      x = as.data.frame(train_fs),
      y = train_all$inside_sales,
      method = "xgbTree",
      watchlist = list(train = train_dmat_fs, test = test_dmat_fs),
      tuneGrid = params[22,],
      trControl = tc,
      early_stopping_rounds = 20,
      verbose = 0
    )
  }, error = function(e) {
    "failed"
  })
}

xgb_fs$results

# Diesel
xgb_d <- "failed"

while(class(xgb_d) == "character"){
  xgb_d <- tryCatch({
    train(
      x = as.data.frame(train_d),
      y = train_all$inside_sales,
      method = "xgbTree",
      watchlist = list(train = train_dmat_d, test = test_dmat_d),
      tuneGrid = params[22,],
      trControl = tc,
      early_stopping_rounds = 20,
      verbose = 0
    )
  }, error = function(e) {
    "failed"
  })
}

# Unleaded
xgb_u <- "failed"

while(class(xgb_u) == "character"){
  xgb_u <- tryCatch({
    train(
      x = as.data.frame(train_u),
      y = train_all$inside_sales,
      method = "xgbTree",
      watchlist = list(train = train_dmat_u, test = test_dmat_u),
      tuneGrid = params[22,],
      trControl = tc,
      early_stopping_rounds = 20,
      verbose = 0
    )
  }, error = function(e) {
    "failed"
  })
}
# Food service
train_pred_fs <- predict(xgb_fs, train_fs)
test_pred_fs <- predict(xgb_fs, test_fs)
postResample(train_pred_fs, train_all$food_service_sales)
##       RMSE   Rsquared        MAE 
## 33.9781412  0.9956942 24.4870098
postResample(test_pred_fs, test_all$food_service_sales)
##        RMSE    Rsquared         MAE 
## 136.1773057   0.8471135 115.7774812
varImp(xgb_fs)
## xgbTree variable importance
## 
##   only 20 most important variables shown (out of 112)
## 
##                            Overall
## inside_sales               100.000
## diesel_sales                27.762
## x1_mile_pop                 18.108
## x1_2_mile_pop               17.663
## day_typeWEEKDAY              9.309
## x5_min_pop                   8.728
## rv_lanes_fueling_positions   7.136
## unleaded_sales               6.593
## parking_spaces               5.903
## x1_mile_income               3.915
## hi_flow_fueling_positions    3.799
## rv_lanesYes                  3.585
## day_typeWEEKEND              2.771
## x1_2_mile_income             2.758
## inside_sales.fitted_l1       2.693
## day_id                       2.662
## food_service_sales_l1        2.432
## x7_min_pop                   2.407
## food_service_sales_l7        2.144
## x1_2_mile_emp                2.067
# Diesel
train_pred_d <- predict(xgb_d, train_d)
test_pred_d <- predict(xgb_d, test_d)
postResample(train_pred_d, train_all$diesel_sales)
##        RMSE    Rsquared         MAE 
## 737.3965079   0.9864328 401.7019454
postResample(test_pred_d, test_all$diesel_sales)
##         RMSE     Rsquared          MAE 
## 1313.6496612    0.1816536  991.0482281
varImp(xgb_d)
## xgbTree variable importance
## 
##   only 20 most important variables shown (out of 112)
## 
##                               Overall
## hi_flow_fueling_positions     100.000
## men_toilet_count               46.937
## men_urinal_count               46.313
## inside_sales                   20.086
## women_toilet_count             16.409
## food_service_sales             11.805
## rv_lanes_fueling_positions      8.109
## parking_spaces                  6.129
## day_typeWEEKDAY                 5.394
## unleaded_sales                  4.223
## women_sink_count                4.117
## day_id                          3.398
## day_typeWEEKEND                 3.394
## x1_mile_pop                     3.256
## traditional_fueling_positions   2.918
## x1_mile_emp                     2.727
## x1_2_mile_emp                   2.725
## hi_flow_rv_lanesYes             2.238
## x5_min_emp                      2.082
## diesel_sales_l7                 1.827
# Unleaded
train_pred_u <- predict(xgb_u, train_u)
test_pred_u <- predict(xgb_u, test_u)
postResample(train_pred_d, train_all$unleaded_sales)
##         RMSE     Rsquared          MAE 
## 2.095876e+03 3.681041e-02 1.562379e+03
postResample(test_pred_d, test_all$unleaded_sales)
##         RMSE     Rsquared          MAE 
## 1.984768e+03 5.617575e-04 1.641709e+03
varImp(xgb_u)
## xgbTree variable importance
## 
##   only 20 most important variables shown (out of 112)
## 
##                    Overall
## diesel_sales        100.00
## men_toilet_count     92.84
## parking_spaces       87.53
## inside_sales         83.51
## x5_min_emp           67.27
## x1_mile_pop          61.21
## x5_min_pop           40.68
## x7_min_emp           39.29
## unleaded_sales_l7    35.06
## x1_mile_emp          28.66
## food_service_sales   28.39
## women_sink_count     25.79
## day_id               24.65
## x5_min_inc           23.39
## x1_mile_income       21.24
## unleaded_sales_l1    21.13
## x7_min_inc           20.50
## x1_2_mile_emp        14.88
## bonfire_grillYes     13.28
## x7_min_pop           11.66

XGBoost Results

What works for one sales metric does not necessarily work for the others. While it’s very computationally expensive, parameter tuning for each sales metric would likely yield the best results. Not surpisingly, each sales metric is fairly predictive of the others. men_toilet_count offered an unexpected contribution which may indicate existence of some other confounding explanatory variable.

Results

We have taken the RMSE value of all the models to compare the model performance. Looking at all the values, we observed that the SVR model performs the best with the given dataset across the different target variables based on final RMSE:
- inside_sales = 0.3448517
- food_service = 0.2782577
- diesel = 0.567455
- unleaded = 0.8332988

We have only used the time series data for this model because the qualitative data was not highly co-related while the target variables were co-related within themselves. Hence, Maverick should consider using the SVR model for their forecasting. This SVR Model provides sufficient evidence that Maverik’s goal can be achieved.

While we were able to devise these models, if given more data and time would have resulted in better results. There is likely untapped potential in the other models that could be unleashed with more time and experimentation. Some models require massive computational resources to be sure the best parameters are utilised.

Member Contribution